hop length
Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency
Lay, Bunlong, Makarov, Rostislav, Gerkmann, Timo
Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.
Skeleton-Guided Learning for Shortest Path Search
Liu, Tiantian, Li, Xiao, Li, Huan, Lu, Hua, Jensen, Christian S., Xu, Jianliang
Shortest path search is a core operation in graph-based applications, yet existing methods face important limitations. Classical algorithms such as Dijkstra's and A* become inefficient as graphs grow more complex, while index-based techniques often require substantial preprocessing and storage. Recent learning-based approaches typically focus on spatial graphs and rely on context-specific features like geographic coordinates, limiting their general applicability. We propose a versatile learning-based framework for shortest path search on generic graphs, without requiring domain-specific features. At the core of our approach is the construction of a skeleton graph that captures multi-level distance and hop information in a compact form. A Skeleton Graph Neural Network (SGNN) operates on this structure to learn node embeddings and predict distances and hop lengths between node pairs. These predictions support LSearch, a guided search algorithm that uses model-driven pruning to reduce the search space while preserving accuracy. To handle larger graphs, we introduce a hierarchical training strategy that partitions the graph into subgraphs with individually trained SGNNs. This structure enables HLSearch, an extension of our method for efficient path search across graph partitions. Experiments on five diverse real-world graphs demonstrate that our framework achieves strong performance across graph types, offering a flexible and effective solution for learning-based shortest path search.
Learnable Adaptive Time-Frequency Representation via Differentiable Short-Time Fourier Transform
Leiber, Maxime, Marnissi, Yosra, Barrau, Axel, Meignen, Sylvain, Massoulié, Laurent
The short-time Fourier transform (STFT) is widely used for analyzing non-stationary signals. However, its performance is highly sensitive to its parameters, and manual or heuristic tuning often yields suboptimal results. To overcome this limitation, we propose a unified differentiable formulation of the STFT that enables gradient-based optimization of its parameters. This approach addresses the limitations of traditional STFT parameter tuning methods, which often rely on computationally intensive discrete searches. It enables fine-tuning of the time-frequency representation (TFR) based on any desired criterion. Moreover, our approach integrates seamlessly with neural networks, allowing joint optimization of the STFT parameters and network weights. The efficacy of the proposed differentiable STFT in enhancing TFRs and improving performance in downstream tasks is demonstrated through experiments on both simulated and real-world data.
Textless NLP -- Zero Resource Challenge with Low Resource Compute
Ramadass, Krithiga, Singh, Abrit Pal, J, Srihari, Kalyani, Sheetal
Coding (VQ-CPC) [8] as the encoder in our speech processing The availability of text data for low-resource languages has pipeline. The input audio files are preprocessed and always been a challenge and transfer learning from multilingual extracted as log-Mel spectrograms. The initial processing models has its own limitations. End-to-End spoken systems involves convolution and normalization layers to extract highlevel without involving text have received significant attention features. These features are then passed through an in the recent years. The Zero-Resource challenge (ZRC) [1] auto-regressive network, which predicts future representations has enabled addressing the low-resource language representation of the input based on past information. One of the key problem and has been a significant driver in this area. In characteristics of VQ-CPC is its use of vector quantization as the acoustic unit discovery task for ZRC, high-dimensional a bottleneck to discretize the continuous embeddings extracted input speech data is mapped to its latent representation to by the autoregressive network into a finite set of discrete codes.
Optimising MFCC parameters for the automatic detection of respiratory diseases
Yan, Yuyang, Simons, Sami O., van Bemmel, Loes, Reinders, Lauren, Franssen, Frits M. E., Urovi, Visara
Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.
Differentiable short-time Fourier transform with respect to the hop length
Leiber, Maxime, Marnissi, Yosra, Barrau, Axel, Badaoui, Mohammed El
The short-time Fourier transform (STFT) is a frequently used tool for analyzing non-stationary digital signals in various fields including audio Stafford et al. [1998], medicine Huang et al. [2019], and vibration analysis Leclère et al. [2016]. Spectrograms, which are obtained from the STFT magnitude, are essential for visualizing, understanding, and processing non-stationary signals in time-frequency representation. The STFT parameters, including tapering function, window length, and hop length, are critical and dependent on the application and signal characteristics. The tapering function balances frequency resolution and spectral leakage, with a narrower main lobe providing better frequency resolution at the expense of increased spectral leakage, and a wider main lobe reducing spectral leakage but decreasing frequency resolution. The Hann or Hamming window is a common starting point, but the best choice depends on the application's specific requirements. Actually, most studies on STFT parameters have focused on the choice of the window length, as it determines the time-frequency resolution trade-off. A shorter window length provides better time resolution but poor frequency resolution. Conversely, a longer window length provides better frequency resolution but poor time resolution. To provide more precise control over temporal and frequency resolution based on the local characteristics of the input signal, researchers have proposed using variable-length windows.
Play It Back: Iterative Attention for Audio Recognition
Stergiou, Alexandros, Damen, Dima
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.